How to build a custom Named Entity Recognition System

January 25, 2022

Introduction

Named Entity Recognition (NER) is a task in Natural Language Processing that involves identifying and classifying named entities, such as people, organizations, locations, and dates, in unstructured text. NER is a common preprocessing step for applications such as knowledge extraction, text classification, question-answering, and sentiment analysis.

While there are several pre-trained NER models available for use, building a custom NER system can be necessary when dealing with domain-specific entities or when higher precision and recall are required. In this blog post, we will explore the steps and best practices in building an NER system and compare some popular tools and frameworks for creating one.

Building a custom Named Entity Recognition System

The process of building a custom NER system typically involves the following steps:

Data Collection

The first step in building a custom NER system is to collect and label a dataset of text documents that contain the entities of interest. For instance, if the goal is to recognize disease names in medical records, the dataset should comprise medical documents with labeled disease names.

Data Preprocessing

The labeled dataset needs to undergo data preprocessing to convert the raw text data into a machine-readable format suitable for training the model. This process often involves tokenization, normalization, and feature extraction.

Model Selection

Next, one needs to select a suitable model for training the NER system. The choice of model depends on the size and complexity of the dataset, available computing resources, and computational efficiency required. Some popular models used for NER include Conditional Random Fields (CRF), Support Vector Machines (SVM), and Deep Learning models such as Recurrent Neural Networks (RNN) and Convolutional Neural Networks (CNN).

Training and Evaluation

Once a model is selected, one needs to train it on the labeled dataset and adjust the hyperparameters such as learning rate, regularization, and optimization strategies to improve the model's performance. It is also essential to evaluate the trained model's performance on the test dataset and compare it with other competing models to select the best one.

Deployment

Finally, the trained model can be deployed for use in applications requiring NER.

Popular Tools and Frameworks for building a custom NER system

There are several tools and frameworks available for building a custom NER system. Here, we compare some popular ones:

SpaCy

SpaCy is an open-source Python library designed for NLP tasks such as dependency parsing, POS tagging, and NER. It provides pre-trained models for NER on multiple domains, including medical, legal, and news. SpaCy's NER models use a rule-based approach along with a statistical model for named entity recognition. SpaCy provides an interactive visualizer for exploring and fine-tuning the NER model.

Natural Language Toolkit (NLTK)

NLTK is a popular Python library for NLP tasks like tokenization, stemming, and sentiment analysis. It provides tools for building custom NER systems using machine learning models like Naive Bayes, Maximum Entropy, and Decision Trees. NLTK also provides labeled datasets for training and evaluating NER models.

Stanford Named Entity Recognizer (Stanford NER)

Stanford NER is a Java-based tool for NER developed by Stanford University. It provides pre-trained models for NER in English, German, and Spanish. Stanford NER uses a rule-based approach along with a statistical model for named entity recognition. It also allows the use of user-defined entity types.

AllenNLP

AllenNLP is an open-source NLP library developed by the Allen Institute for Artificial Intelligence. It provides tools for building custom NER systems using deep learning models such as RNNs and CNNs. AllenNLP provides pre-trained models for NER and allows fine-tuning them for specific tasks and domains.

Conclusion

Building a custom named entity recognition system can be challenging, but it can be a valuable effort when domain-specific entities or higher precision and recall are required. The choice of tools and frameworks for building an NER system depends on several factors such as the size and complexity of the dataset, available computing resources, and computational efficiency required. We have compared some popular options for building an NER system, including SpaCy, NLTK, Stanford NER, and AllenNLP.

References:

Jodie Burchell, "A Comprehensive Guide to Named Entity Recognition (NER) using PyTorch and Transformers," Towards Data Science, 2021.
Ehsan Amjadian and Mostafa Dehghani, "A Survey on Named Entity Recognition Using Deep Learning: Techniques, Applications, and Research Directions," arXiv, 2020.
"Natural Language Processing with Python," NLTK.org.
"SpaCy 101: Everything you need to know," spaCy.io.
"Stanford Named Entity Recognizer," Stanford University.
"AllenNLP: A Deep Semantic Natural Language Processing Platform," allennlp.org.